于Windows上部署ChatGLM2-6B 以及其WebUI

寻思要给自己搞一个能上网的GLM2-6B，就在GLM官网找有没有友链，正好找到了这么一条：ChatGLM-6B-Engineering，于是就打算在本地部署一下（我是比较讨厌语言模型不在本地的那种）。坑有点多，于是用几乎是0基础的方式写了这么一篇教程。
本篇文章使用venv来创建虚拟环境，conda创建环境的可以找别人了。

0x0 拉代码

到这个工程的仓库拉下代码，解压到一个文件夹。
创建一个命令行窗口，然后cd到你的目录。

1 2	rem Go to your working dir. cd /D D:\path\to\your\workspace

0x1 创建环境

然后创建python虚拟环境（有些人极度讨厌虚拟环境可以跳过，此处使用venv而不是conda是因为我比较讨厌conda）：

1 2	rem Run python to create venv. python -m venv .\venv

确保缓存的盘足够，如果不能的话，就会报pip没有足够的空间安装，你需要这么做：

1 2	rem Set the cache dir. set TMPDIR=D:\your\cache\dir

然后再在同一窗口安装执行：
pip安装依赖：

1	pip install -r requirements.txt

0x2 使用模型搭建GLM层API

接下来GLM模型就已经基本可以使用了。
ChatGLM有量化，可以在稍微性能不足一点的电脑上运行：
以下内容摘自ChatGLM-6B的README.md:

硬件需求

量化等级	最低 GPU 显存（推理）	最低 GPU 显存（高效参数微调）
FP16（无量化）	13 GB	14 GB
INT8	8 GB	9 GB
INT4	6 GB	7 GB

实在不行还可以在CPU上运行，这个稍后会提到。

1 2	tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(4).half().cuda()

上面的代码中，表示量化的在这里：.quantize(4)，这表示该模型使用INT4量化，同理.quantize(8)就是使用INT8量化。
但是使用他的API还需要改一点东西：
他的模型使用本地模型，我们把它改成线上模型。
改一行代码：
.\api.py
更改前：

#tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
#model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(4).half().cuda()
#tokenizer = AutoTokenizer.from_pretrained(r"E:\huggingface\models--THUDM--chatglm-6b\snapshots\a10da4c68b5d616030d3531fc37a13bb44ea814d", trust_remote_code=True)
#model = AutoModel.from_pretrained(r"E:\huggingface\models--THUDM--chatglm-6b\snapshots\a10da4c68b5d616030d3531fc37a13bb44ea814d", trust_remote_code=True).quantize(4).half().cuda()
tokenizer = AutoTokenizer.from_pretrained(r"E:\model\chatglm2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained(r"E:\model\chatglm2-6b", trust_remote_code=True).quantize(4).half().cuda()

更改后：

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(4).half().cuda()
#tokenizer = AutoTokenizer.from_pretrained(r"E:\huggingface\models--THUDM--chatglm-6b\snapshots\a10da4c68b5d616030d3531fc37a13bb44ea814d", trust_remote_code=True)
#model = AutoModel.from_pretrained(r"E:\huggingface\models--THUDM--chatglm-6b\snapshots\a10da4c68b5d616030d3531fc37a13bb44ea814d", trust_remote_code=True).quantize(4).half().cuda()
#tokenizer = AutoTokenizer.from_pretrained(r"E:\model\chatglm2-6b", trust_remote_code=True)
#model = AutoModel.from_pretrained(r"E:\model\chatglm2-6b", trust_remote_code=True).quantize(4).half().cuda()

现在就是线上模型了。但是按照他写的，我们现在实际上用的是CPU处理。还是使用上面的代码会报错，这里我们分两种情况：

1、使用CPU进行推理：

更改刚刚那段代码为：

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(4).half().float()
#tokenizer = AutoTokenizer.from_pretrained(r"E:\huggingface\models--THUDM--chatglm-6b\snapshots\a10da4c68b5d616030d3531fc37a13bb44ea814d", trust_remote_code=True)
#model = AutoModel.from_pretrained(r"E:\huggingface\models--THUDM--chatglm-6b\snapshots\a10da4c68b5d616030d3531fc37a13bb44ea814d", trust_remote_code=True).quantize(4).half().cuda()
#tokenizer = AutoTokenizer.from_pretrained(r"E:\model\chatglm2-6b", trust_remote_code=True)
#model = AutoModel.from_pretrained(r"E:\model\chatglm2-6b", trust_remote_code=True).quantize(4).half().cuda()

现在我们就可以在CPU上使用ChatGLM了。

2、使用CUDA进行推理：

代码无需更改，但是我们需要更改torch的版本。现在是CPU版本，我们需要卸载并安装GPU版本。

1	pip uninstall torch

2.1 你没有CUDA：

没有安装CUDA的按照以下步骤下载并安装：
首先检查NVidia的显卡支持的CUDA版本：

右键NVidia设置；
点击NVidia控制面板；
点击“帮助”-“系统信息”；
点击“组件”；
查看“3D 设置”-“NVCUDA64.DLL”-“产品名称”
想必电脑应该基本都是x64的CPU，如果还是x86的我也不知道怎么办，换电脑吧……
下载的CUDA版本不得低于产品名称里显示的版本。于是我下载了11.7的本地安装版
最新CUDA在这里下载。
安装的时候，可以精简，如果没地方安装可以选择自定义，然后只选择CUDA，甚至还可以把CUDA的Document、Visual Studio支持也取消选择。
选择安装位置，点击安装。
接下来按照2.2走：

2.2 你有CUDA：

torch官网，找到Install PyTorch，按照实际情况选择：
我的情况：
系统：Windows
包安装器：Pip
语言：Python
计算平台：CUDA 11.7
那么就有：

1	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

记得以上命令在虚拟环境中执行。

然后我们可以这样验证：

python

在python交互式终端中输入：

1 2	import torch torch.cuda.is_available()

如果输出为True，那么我们就安装好了。我们正常使用就OK了。

3、batch脚本一键启动

从.\venv\Scripts\activate.bat中，复制所有文本，在工作目录下新建api.bat，粘贴。
脚本：

...(这一段是你的activate.bat中的内容)
rem 更改标题
title ChatGLM-6B WebUI API
rem 启动GLM层API
python .\api.py
pause

接下来是交互层API：

0x3 搭建交互层API

这部分其实很简单，只需要更改一下他的插件里面的浏览器内核的配置就可以了：

打开.\plugins\web.py
更换所有的不在字符串里的Chrome为Edge，为的是使所有Chrome配置更改为Edge配置
保存
就搞定了。

一键启动基本一样：
从.\venv\Scripts\activate.bat中，复制所有文本，在工作目录下新建front_end.bat，粘贴。
脚本：

...(这一段是你的activate.bat中的内容)
rem 更改标题
title ChatGLM-6B WebUI front end
rem 启动交互层API
python .\front_end.py
pause

0x4 搭建前端

有两种前端，一个是在项目主分支里写好的，已经被我们下载下来的前端，还有一种是另一根类Open AI式的前端。我会分开讲：

1. 使用已下载的前端

一键启动就行：
从.\venv\Scripts\activate.bat中，复制所有文本，在工作目录下新建web.bat，粘贴。
脚本：

...(这一段是你的activate.bat中的内容)
rem 更改标题
title ChatGLM-6B gradio demo
rem 启动GLM层API
python .\gradio_demo.py
pause

2. 使用类Open AI的前端

从这里下载，并解压缩到一个文件夹内，我个人习惯丢到工作目录下的新文件夹：.\WebUI
解压缩之后，下载并安装node.js 14.21.3。
切换到WebUI工作目录并运行以下指令：

1	npm install

等安装完毕之后，打开src\App.vue，并修改：

把所有的process.env.VUE_APP_API替换为"http://127.0.0.1:8003"
保存
然后在创建一个web.bat，输入以下内容：

@echo off
rem 替换D:\your\path\to\webui为你的WebUI工作路径
cd /D D:\your\path\to\webui
title ChatGLM-6B Web UI
npm run dev
pause

0x5 修补

该插件使用markmap绘制思维导图
所以如果启用了markmap，你需要执行这个：

1	npm install markmap markmap-cli -g

类Open AI前端的左侧标题有对于该主题的概括，但是概括的时候输入并没有被赋值，所以在

1	async def chat(prompt: str):

这一句后面添加一行：

1	chat_prompt = prompt # This is temp fix.

关于这一点我已经提交了issue，就看作者怎么处理这个了。

2023-09-03编辑：该bug已经修补了：
A temp fix for sidebar title. 对于侧边栏标题的临时性修复。 by rong-xiaoli · Pull Request #31

使用网络插件的时候，可能是因为某些bug，无法返回查询内容（可能连请求都发不出去），已经提交Issue：无法使用网络搜索 · Issue #32
我的解决方案是从Chrome切换至Edge，也就是把除了UA部分的Chrome全部改成Edge（或者说把所有Chrome的实现改为Edge）

0x6 最后需要注意的点

需要先启动GLM层API，启动完成后再启动交互层API；前端随时可以启动；
本篇所有的目录请根据实际情况核实一遍，不要直接Ctrl+C，Ctrl+V就不管了；

容小狸的博客